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We apply Markov chain lumping techniques to aggregate codons from an empirical 
substitution matrix. The standard genetic code as well as higher order amino acid substi- 
tution groups are identified. Since the aggregates are derived from first principles they do 
not rely on system dependent assumptions made beforehand, e.g. regarding criteria on 
what should constitute an amino acid group. We therefore argue that the acquired aggre- 
gations more accurately capture the multi-level structure of the substitution dynamics 
than alternative techniques. 

Keywords: Amino acids; Codons; Genetic code; Substitutions; Substitution groups; Evo- 
lution; Aggregation; Lumping. 

1. Introduction 

Ever since its discovery by Nirenberg et al. [T] , the organization of the genetic code 
has been studied extensively. One main theme of research concerns the organization 
with respect to physico-chemical properties of the codons, amino acids and amino 
acid groups. Similar codons are for instance associated with amino acids with sim- 
ilar properties [2] and amino acids with simple structures are typically coded by 
more codons [3]. Many of these studies are based on a snapshot of the code, where 
static information of what codes to what is used. Although amino acids may be 
grouped with respect to specific properties, it is difficult to quantitatively judge 
the relative relevance of these properties. From this standpoint it may prove more 
conclusive to study the evolutionary dynamics acting on codons and amino acids. 
The amino acid substitution process is often modeled as a Markov chain, where 
the distribution of substitutions of a given residue is independent of neighboring 
residues as well as prior residues at the same site. These assumptions are clearly 
false, but meaningfully suffices as approximations. Dayhoff and coworkers estimated 
substitution frequencies empirically from alignments of related sequences [4j. From 
inspection of log odds scores they concluded that amino acids with similar proper- 
ties indeed tend to form groups that are conserved: Members of a group substitute 
with a high frequency internally compared to substitution frequencies to external 
amino acids. This is natural since replacements within such groups are less likely 
to have a harmful effect. By using the transition matrix provided by Dayhoff et 
al. as a direct estimate of similarity, French et al. applied multi-dimensional scaling 
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to demonstrate that amino acids that are Hkely to mutate into each other more 
specifically are correlated with respect to side chain volume and hydrophobicity [5] . 
These findings are in line with more recent work, e.g. by Wu et al |6], who derived 
substitution groups from empirical statistics by the same conservation criterion. 

In this paper we present an alternative and more general approach to amino 
acid substitution groups that is based on the concept of Markov chain lumping [7] . 
For this study we use empirical codon substitution frequencies provided by Schnei- 
der et al. [8]. These were estimated from 17,502 pairwise alignments of orthologous 
sequences from human, mouse, chicken, frog and zebrafish. 8.3 million codons were 
aligned and the substitutions between codons were counted. A substitution proba- 
bility matrix was derived from the resulting counts. For this purpose they utilized 
the current complete vertebrate genome databases from ENSEMBL ^ . By employ- 
ing two complementary Markov lumping techniques [101 E] i we infer aggregated 
representations of Schneider et al.'s substitution probability matrix. These tech- 
niques apply to linear dynamical systems in general (including Markov chains) and 
may therefore in addition be used to analyze hierarchical organization in other types 
of biological systems. Since the lumping criterion is derived from first principles it is 
independent of interpretations of the specific system at hand. Assumptions e.g. re- 
garding amino acid conservation or group isolation, in the specific case of codon 
substitutions, are therefore not necessary. 



2. Method 

Aggregation of a Markov chain means that the state space is partitioned by an ag- 
glomeration of the original states into macro-states. The result is a reduced process 
with fewer states than the original Markov chain. In general an aggregated dynamics 
is not a Markov chain as the reduced process has longer memory than the original 
process. In the special cases when the process over the partitions are a Markov chain 
with the same order as the original process, we say that the aggregation is a lumping. 
An important special case of lumpability is when the dynamics has different time 
scales. Approximate lumps can then defined as meta-stable states and the Markov 
transitions are characterized by much more frequent jumps within the lumps than 
between the lumps. This situation is closely related to the definition of communi- 
ties in networks, a subject that has been extensively studied recently [12l [13]. It 
turns out that the codon substitution process is approximately lumpable since it 
can be organized into meta-stable aggregated states on several levels. See Fig. [1] for 
a schematic illustration of Markov chain lumping of substitution dynamics. 

In general, the degree by which a coarser process fulfills the Markov criteria can 
be measured as the expected mutual information, (I), between the process' past 
and future states, given its current state. Let {si, S2, s„} be the state space of an 
aggregated process. Pi a stochastic variable of the past states preceding Si, and Fi 
a stochastic variable of the subsequent state of s;. The mutual information between 
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Fig. 1. Markov chain lumping; three levels of dynamics, (a) Codon substitutions are modeled as 
a Markov chain. States represent codons and transitions represent substitutions between codons. 
(b) If the codons are aggregated with respect to the amino acids they code for, the new aggregated 
process remains a Markov chain. For instance, since AAA and AAG both code for lysine, they 
can be aggregated into one unit, (c) Specific aggregates of amino acids also exist such that their 
dynamics is Markovian. Lysine and arginine can for example be merged to form one state. 



past and future states, given a current state Si is 

I(P,; F,) - R{Pi) + H(F,) - il{P,,F,), (1) 
where H(P^) is the Shannon entropy 

n 

= - E = MP^ = S,) (2) 

of Pi. ii{Fi) and H(Pi,Fi) of the joint distribution of Pi and Fi are defined analo- 
gously. Then 

n 

{l) = Y,Pr{s.,)l{Pf,Fi), (3) 

i=l 

where Pr(si) is probability that the system is in state Si. The criterion can be used 
to test whether or not a given partition defines a lumping, but it is not necessarily 
useful for constructing the partitions that define lumpings. Since the number of 
possible ways to partition a state space of N states is astronomical even for rela- 
tively small TV it is not feasible to evaluate all partitions. However, using a spectral 
decomposition of the transition matrix it is possible derive lumped Markov chains, 
even when the number of states is relatively large (on the order of 10'^ or more if 
the transition matrix is sparse). This is due to the following observation [TUl ITTj: 
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Fig. 2. (a) Eigenvalues of the codon substitution transition matrix P. A distinct spectral gap 
after the 21 first eigenvalues (marked in red) suggests that the 21 first eigenvectors reveal an 
aggregation, (b) Vector elements of the fourth eigenvector of P are organized in level sets, where 
codons that map to the same amino acid are on the same level (with the exception of serine). All 
the eigenvalues are real because the transition matrix P is reversible. 



Consider n eigenvectors of the transition matrix. These wiU define N points in an n 
dimensional space, where each point is associated with a state in the Markov chain. 
// the N points form n clusters, these clusters define an aggregation, where aggre- 
gates of states are given by corresponding points within clusters. See [Appendix A| 
for details. Spectral lumping tends to work well when the time scale separation 
in the dynamics is strong, or more generally when lumpings are distinct. For the 
codon substitution process we will see that the genetic code is clearly detected using 
the spectral method, whereas aggregations of amino acids cannot be detected by 
the same technique. In the latter case we instead employ a direct agglomeration 
method. Starting with each state in a separate lump, the state space is successively 
aggregated by joining the pair of lumps that gives the best lumping result according 
to the mutual information criterion in Eq. [3l The agglomeration method is expected 
to give good results on the first few levels in the aggregation hierarchy but become 
more unreliably at the more coarse levels. 



3. Results 

To describe the codon substitution process we use the transition matrix, P, provided 
by Schneider et al. [8]. Note that the system has 61 states as substitutions from the 
three stop codons are not considered. The spectrum of P has a clear gap after the 
21 eigenvalue, Fig.[2l^a). This gap indicates that the 21 first eigenvectors may reveal 
an aggregation of the substitution process [10]. By clustering the elements of the 21 
first eigenvectors of P — resulting in 61 points in a 21 dimensional space — 21 distinct 
clusters are acquired. Since the number of eigenvectors equals the number of clusters, 
these define a lumping. As exemplified in Fig. ^h) the clusters show as level sets 
in the individual eigenvectors. Each cluster constitutes codons that are associated 
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Fig. 3. A dendrogram of the result of an agglomeration based on successively joining pairs of states 
or lumps that result in the best lumping with respect to the mutual information measure in Eq. |3] 
The dashed line marks the most significant aggregation, which is also shown in Fig.|4] Si denotes 
serine coded by TCT, TCC, TCA and TCG, and S2 denotes serine coded by AGT and AGC. 
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Fig. 4. Amino acid groups resulting in the most significant lumping {A, T}, {I, M, V} and {E, D} 
as shown in the standard genetic code table. 



with the same amino acid, with the exception of the codons of serine, which are 
divided into two clusters. This unique separation is due to that serine is the only 
amino acid whose codons are not connected with single point mutations (i.e. some 
codons are separated by a Hamming distance larger than one on a hypercube) . The 
aggregation of individual codons to amino acids, which we know as the standard 
genetic code, reflects that the functionality of member codons within an aggregate 
are invariant under mutations as they code for the same amino acid. 

At the aggregated level of amino acid substitution, lumpings are not as clearly 
revealed by the eigenvectors. This is to be expected since the redundancy in the 
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genetic code reflects a much stronger neutrality than possible similarities between 
the amino acids. If the partitioning of the state space is viewed as an optimiza- 
tion problem aiming to minimize the mutual information defined in Eq. [3l then 
there are many almost equivalent minima. In this situation the spectral method 
does not function well. It is nevertheless possible to identify significant lumpings 
of amino acids into families that tend to be conserved by the substitution process. 
Two generic optimization methods are used: Simulated annealing and the agglom- 
eration method described in Section [21 Due to that tryptophan (W) has very low 
mutability and is the least occurring amino acid, a significant two-state lumping 
exists where tryptophan forms one aggregate and the rest of the amino acids form 
another aggregate. To simplify further analysis thryptophan is therefore discarded. 
The result from the agglomerative aggregation is shown in Fig. [3l The most signif- 
icant lumping found using either method has the aggregates {A, T}, {/, M, V} and 
{E, D} and is shown in Fig. [H 

4. Discussion 

Markov lumping based on empirical substitution frequencies in coding DNA se- 
quences provides a natural rationale for grouping codons and amino acids. Due to 
its construction the lumpability condition has a clearer connection to the evolu- 
tionary dynamics than previous aggregation methods suggested in the literature, 
such as multi-dimensional scaling of the transition matrix [5^ or the compactness 
and isolation criteria used in ^ . Under the assumption that the codon substitution 
process can be approximately described as a Markov process, we argue that the 
lumping is an effective method for analyzing the structure of the genetic code and 
its higher level organization into approximately conserved amino acid families. 

One may hypothesize that aggregated dynamics of codon substitutions provide 
information about the origin of the genetic code. There are several theories aiming 
to address the fundamental question on how the code came to be. See Ref. [M] for 
a comprehensive comparison. With the exception of the frozen accident theory by 
Crick [15], these theories couple the evolution of the genetic code primarily with 
physico-chemical properties of the amino acids or evolved biosynthetic pathways. 
Woese [H] , specifically, suggested that the code has evolved by a process of ambigu- 
ity reduction. The idea is that a crude primordial version of the code, where groups 
of codons code for groups of amino acids with resembling properties, evolved into 
the code's current state by a series of refinements. May amino acid groups reflect 
earlier versions of the code? Riddle et al. |17j experimentally searched for a mini- 
mum set of amino acids capable of forming complex protein folds. They found that 
the five amino acids A, G, I, E and K are capable of forming most of the ancient 
SH3 protein domain. Consider the aggregation in Fig. (4] A and I are members of 
separate aggregates, G form its own aggregate; and E and K are in separate aggre- 
gates prior to the last agglomeration leading to the aggregation, Fig. [3l One could 
speculate that some of these aggregates reflect group codons in an earlier version 
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of the code, and that these groups were speciahzed into present day codons. Such a 
view is partly supported Jimenez-Montao's hypothesis on the evolutionary history 
of the code, which is based on group theory and the thermodynamics of codon- 
anticodon interactions |18| . In the proposed evolutionary tree, amino acids within 
aggregates {A,T}, {I,M,V} and {E,D} share the same branches up till the two 
last reassignment of codons. However, a careful analysis would be required in order 
to conclusively relate acquired aggregates to the evolution of the standard genetic 
code and its deviates. 
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Appendix A. Spectral lumping method 

Consider a Markov chain with N states and a probability distribution vector at 
time t denoted as x{t). The N x N transition matrix P = [pij] defines the evolution 
of the probability distribution x{t + 1) = x{t)P. A partition of the state space can 
be defined by a iV x if matrix 11 = [nik], where K is the number of aggregates 
and TTifc = 1 if state i is in aggregate k and otherwise. The matrix 11 consists of 
only O's and I's with exactly one 1 in each row. A necessary and sufficient criterion 
for a partitioning to define a lumping of a Markov chain is that for each state in a 
specific aggregate the total transition probabihty into other aggregates are equal. 
For a transition matrix P this can be expressed as PijWjk being constant for all 
i within an aggregates and k \7\ . Or equivalently, there must exist a, K x K matrix 
Q such that 

pn = no, (A.i) 

in which case the reduced probability distribution is defined by y(t) ~ x{t)Il and 
Q = [qij] is the transition matrix for the lumped process. We note that the left 
hand side of Eq. lA.ll produces a new matrix by P acting on each column in H, 
whereas on the right hand side the matrix is constructed by each column being a 
linear combination of the columns in 11 with linear coefficients qij . This implies that 
there exist a matrix Q satisfying Eq. lA.ll if and only if the columns of 11 span a 
left invariant subspace of P. If P is diagonalizable, which is the case for the codon 
substitution process, then any (right) invariant subspace can be spanned by the 
(right) eigenvectors. 

If the columns of 11 are chosen to be a subset of the eigenvectors, then Q is just 
a diagonal matrix with the corresponding eigenvalues on the diagonal. This choice 
does define a reduction of a linear system but not usually a lumping of a Markov 
chain. The columns of 11 must contain exactly one 1 and the rest of the elements 
must be 0. At the same time the K columns must be linear combinations of a set 
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of K eigenvectors of P. To see how these two requirements can be met, consider an 
example of a lumping matrix that aggregates four states into two. The two columns 
of the n matrix are, (1,0,0,1)"^ and (0,1,1,0)"'". Any linear combination of the 
columns has the structure (a, 6, 6, a)""", i.e. the elements that are lumped into the 
same aggregate are identical. From this observation it is possible to show that a 
necessary and sufficient condition for lumping a Markov chain x{t + 1) = x{t)P 
into K lumps is that there exist K right eigenvectors of P whose elements constant 
within the aggregates [ TT] . 

Given a set of right eigenvectors, ^ , , . . . , Vi^ , a suggested lumping can be 
identified using a clustering algorithm, e.g. K-mean. For each of the N states in the 
Markov chain the elements of the eigenvectors define a K dimensional coordinate. 
States that are in the same aggregate should have elements that are close together in 
all eigenvectors, e.g. as in Fig.[2llb), and therefore form clusters in the isT-dimensional 
space. 

In practice, the lumpability condition (jA.l|l is never fulfilled exactly. It is there- 
fore important to evaluate the quality of an approximate lumping. It is of course 
possible to use Eq. lA.ll directly, for example by defining the quality of a lumping as 
e.g. the norm ||Pn — n(5||2, which should be minimized. However, the fundamental 
idea behind Markov chain lumping is to find aggregated Markov dynamics whose 
order is not increased, i.e. does not have longer memory than the original process. 
Since memory is naturally defined in terms of the mutual information between the 
past and the future conditioned on the present, as in Eq. [31 we use this as a lumping 
measure. 



